uniform pac bound
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two setups that has been missing in the literature. We demonstrate the benefits of the new framework for finite-state episodic MDPs with a new algorithm that is Uniform-PAC and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon.
Reviews: Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
The paper defines "Uniform-PAC" where uniformity is over the optimality criterion, eps. It is PAC like in that optimal actions are taken in all but a bounded number of steps. It is also regret like in that the algorithm is eventually good relative to any epsilon---not just one it is told to meet. I thought the discussion of different performance metrics was thorough and informative. I would have liked more intuition about the iterated logarithm idea and its main properties, but I understand that the highly technical stuff had to be expressed in very limited space.
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
Dann, Christoph, Lattimore, Tor, Brunskill, Emma
Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two setups that has been missing in the literature. We demonstrate the benefits of the new framework for finite-state episodic MDPs with a new algorithm that is Uniform-PAC and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon. Papers published at the Neural Information Processing Systems Conference.